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Abstract 

This paper presents a pre-processing and a distance which improve the perfor¬ 
mance of machine learning algorithms working on independent and identically 
distributed stochastic processes. We introduce a novel non-parametric approach to 
represent random variables which splits apart dependency and distribution with¬ 
out losing any information. We also propound an associated metric leveraging 
this representation and its statistical estimate. Besides experiments on synthetic 
datasets, the benefits of our contribution is illustrated through the example of 
clustering financial time series, for instance prices from the credit default swaps 
market. Results are available on the website www. datagrapple . com and an 
IPython Notebook tutorial is available at www. datagr apple . com/Tech for 
reproducible research. 


1 Introduction 


Machine learning on time series is a booming field and as such plenty of representations, transfor¬ 
mations, normalizations, metrics and other divergences are thrown at disposal to the practitioner. 
A further consequence of the recent advances in time series mining is that it is difficult to have 
a sober look at the state of the art since many papers state contradictory claims as described in 
( [Ding et aL||2008| ). To be fair, we should mention that when data, pre-processing steps, distances 
and algorithms are combined together, they have an intricate behaviour making it difficult to draw 
unanimous conclusions especially in a fast-paced environment. Restricting the scope of time se¬ 
ries to independent and identically distributed (i.i.d.) stochastic processes, we propound a method 
which, on the contrary to many of its counterparts, is mathematically grounded with respect to the 


clustering task defined in subsection 0 The representation we present in Section 
property similar to the seminal result of copula theory, namely Sklar’s theorem ( |Sklar| 


2| exploits a 
T^ . This 


approach leverages the specificities of random variables and this way solves several short comings 

1.21 Section 


of more classical data pre-processing and distances that will be detailed in subsection [ 

[^is dedicated to experiments on synthetic and real datasets to illustrate the benefits of our method 
which relies on the hypothesis of i.i.d. sampling of the random variables. Synthetic time series are 
generated by a simple model yielding correlated random variables following different distributions. 
The presented approach is also applied to financial time series from the credit default swaps mar¬ 
ket whose prices dynamics are usually modelled by random walks according to the efficient-market 
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hypothesis ( |Fama[p^65 ). This dataset seems more interesting than stocks as credit default swaps 
are often considered as a gauge of investors’ fear, thus time series are subject to more violent moves 
and may provide more distributional information than the ones from the stock market. We have 
made our detailed experiments (cf. Machine Tree on the website www. datagrapple . com) and 
Python code available (www. datagrapple . com/Tech) for reproducible research. Finally, we 
conclude the paper with a discussion on the method and we propound future research directions. 


1.1 Motivation and goal of study 


Machine learning methodology usually consists in several pre-processing steps aiming at cleaning 
data and preparing them for being fed to a battery of algorithms. Data scientists have the daunt¬ 
ing mission to choose the best possible combination of pre-processing, dissimilarity measure and 
algorithm to solve the task at hand among a profuse literature. In this article, we provide both a 
pre-processing and a distance for studying i.i.d. random processes which are compatible with basic 
machine learning algorithms. 

Many statistical distances exist to measure the dissimilarity of two random variables, and therefore 
two i.i.d. random processes. Such distances can be roughly classified in two families: 


1. distributional distan ces, for instance ( |Ryabko[ |2010| ), ( [Khaleghi et al.[ |2012| ) and ( |Hen 
[derson et al.| |2015| ), which focus on dissimilarity between probability distributions and 
quantify divergences in marginal behaviours, 

2. dependence distances, such as the distance correlation or copula-based kernel dependency 
measures ( [Poczos et al.||2012 ), which focus on the joint behaviours of random variables, 
generally ignoring their distribution properties. 


However, we may want to be able to discriminate random variables both on distribution and de¬ 
pendence. This can be motivated, for instance, from the study of financial assets returns: are two 
perfectly correlated random variables (assets returns), but one being normally distributed and the 
other one following a heavy-tailed distribution, similar? From risk perspective, the answer is no 
( [Kelly and Jiang[ [2014 ), hence the propounded distance of this article. We illustrate its benefits 
through clustering, a machine learning task which primarily relies on the metric space considered 
(dat a representation a nd associated distance). Besides clustering has found application in finance, 
e.g. ( [Tola et ah) 2008[ ), which gives us a framework for benchmarking on real data. 


Our objective is therefore to obtain a good clustering of random variables based on an appropriate 
and simple enough dist ance for bein g used with basic clustering alg orithms, e.g. Ward hi erarchical 
clustering ([Ward|[1963|), /c-means-F-F ([Arthur and Vassilvitskii|[2007|), affinity propagation ([Frey and 
[DueckI [20071k 


By clustering we mean the task of grouping sets of objects in such a way that objects in the same 
cluster are more similar to each other than those in different clusters. More specifically, a cluster of 
random variables should gather random variables with common dependence between them and with 
a common distribution. Two clusters should differ either in the dependency between their random 
variables or in their distributions. 


A good clustering is a partition of the data that must be stable to small pertu rbations of the dataset. 
“Stab ility of some kind is clearly a desirable property of clustering methods” ( Carlsson and Mem^ 
2010[). In the case of random variables, these small perturbations can be obtained from resampling 


( [Levine and Doman3^[2001[ ), ( [Monti et [2003[ ), ( [Lange et ar||2004[ ) in the spirit o f the bootstrap 
method since it preserves the statistical properties of the initial sample ( Efron[p^79| ). 


Yet, practitioners and researchers pinpoint that state-of-the-art results of clustering methodology 
applied to financial times series are very sensitive to perturbations ( [Lemieux et al.| [2014[ ). The 
observed unstability may result from a poor representation of these time series, and thus clusters 
may not capture all the underlying information. 


1.2 Shortcomings of a standard machine learning approach 

A naive but often used distance between random variables to measure similarity and to perform 
clustering is the L 2 distance E[(X — Yet, this distance is not suited to our task. 
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Figure 1: Probability density functions of Gaussians A/’(—5,1) and A/'(5,1) (in green), Gaussians 
A/’(—5,3) and A/'(5,3) (in red), and Gaussians Af{—b, 10) and A/’(5,10) (in blue). Green, red and 
blue Gaussians are equidistant using L 2 geometry on the parameter space (/i, a). 


Example 1 (Distance L 2 between two Gaussians). Let (X, Y) be a bivariate Gaussian vector, with 
X ~ Af{iix^o-x)’ ^ ^ cind whose correlation is p{X,Y) G [—1,1]. We obtain 

E[(X — YY] = {jjLx — + \(^x — (^yY + (1 — p(X, Y)). Now, consider the following 

values for correlation: 


• p(X, Y) = 0, 5'6> E[(X — Y)‘^] = {px — I^yY + + ^Y' variables are inde¬ 

pendent (since uncorrelated and jointly normally distributed), thus we must discriminate 
on distribution information. Assume px = Py cind ax = cry- Lor ax = cry ^ 1, we 
obtain E[(X — F)^] ^ 1 instead of the distance 0, expected from comparing two equal 
Gaussians. 

• p{X,Y) = 1, so E[(X — F)^] = {px — PyY + — ctyY- Since the variables are 

perfectly correlated, we must discriminate on distributions. We actually compare them 
with a 1/2 metric on the mean x standard deviation half-plane. Howev er, this is not an 
appropriate geometry for comparing two Gaussians (iCosta et~^ \2014^ . For instance, if 
crx = cry = cr, we find E[(X — F)^] = (px — FY) for any values of a. As a grows, 
probability attached by the two Gaussians to a given interval grows similar (cf. Fig. yet 

this increasing similarity is not taken into account by this L 2 distance. 


E[(X—F)^] considers both dependence and distribution information of the random variables, but not 
in a relevant way with respect to our task. Yet, we will benchmark against thi s distance since other 
more sophisticated distances on time series such as dynamic time warping ( Berndt and Cliffoi5[ 
1994| ) and representations such as wavelets ( [Percival and Wald^ |2006| ) or SAX (LinetaL 2003 


were explicitly designed to handle temporal patterns which are inexistant in i.i.d. random processes. 


2 A generic representation for random variables 


Our purpose is to introduce a new data representation and a suitable distance which takes into ac¬ 
count both distributional proximities and joint behaviours. 
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Figure 2: ArcelorMittal and Societe generale prices (T observations ^ 

V^) are projected on dependence 0 distribution space; {Gxi{Xi),Gx 2 {^ 2 )) ^ encode the 
dependence between Xi and X 2 (a perfect correlation would be represented by a sharp diagonal on 
the scatterplot); (Gxi, Gxs) are the margins (their log-densities are displayed above), notice their 
heavy-tailed exponential distribution (especially for ArcelorMittal). 


2.1 A representation preserving total information 

Let (f^, P) be a probability space. O is the sample space, T is the cr-algebra of events, and P is 
the probability measure. Let V be the space of all continuous real-valued random variables defined 
on (f^, T^ P). Let U be the space of random variables following a uniform distribution on [0,1] and 
Q be the space of absolutely continuous cumulative distribution functions (cdf). 

Definition 1 (The copula transform). Let X = (Ai,..., Aat) G he a random vector with cdfs 
Gx = {Gx, ,...,Gx^) eg^. The random vector Gx{X) = {Gx, (^i), • • •, {Xn)) G 

is known as the copula transform. 

Property 1 (Uniform margins of the copula transform). Gxi{Xi), I < i < N, are uniformly 
distributed on [0,1]. 


Proof. X = GxXG~x\{^)) = nXi < Gy^ix)) = < x). □ 

We define the following representation of random vectors that actually splits the joint behaviours of 
the marginal variables from their distributional information. 

Definition 2 (dependence 0 distribution space projection). Let T be a mapping which transforms 
X = (Ai,..., Aat) into its generic representation, an element ofU^ x representing X, defined 
as follow 

T:V^ ( 1 ) 

X ^ {Gx{X),Gx). 

Property 2. T is a bijection. 


Proof T is surjective as any element (U, G) G x has the fiber G~^{U). T is injective as 
(Ui, Gi) = (f/2, G2) a.s. in x implies that they have the same cdf G = Gi = G2 and since 
Ui = U 2 ^2.5'., it follows that G“^(f/i) = G~^{U 2 ) a.s. □ 


This result replicates the seminal result of copula theory, namely Sklar’s theorem (Sklar 1959), 


which asserts one can split the dependency and distribution apart without losing any information. 
Fig. [^illustrates this projection for N = 2. 


2.2 A distance between random variables 

We leverage the propounded representation to build a suitable yet simple distance between random 
variables which is invariant under diffeomorphism. 

Definition 3 (Distance do between two random variables). Let 0 G [0,1]. Let (A, Y) G V^. Let G = 
(Gjv, Gy), where Gx and Gy are respectively X and Y marginal cdfs. We define the following 
distance 

dl{XX) = ed\{Gx{X),GY{Y)) + (1 - e)dl{Gx,GY), (2) 
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where 


dl{Gx{X),GY{Y)) = m[\Gx{X)-GY{Y)W 


(3) 


and 



(4) 


In particular, = yjl — BC is the Hellinger distance related to the Bhattacharyya (1/2-Chernoff) 
coefficient BC upper bounding the Bayes’ classification error. To quantify distribution dissimilarity, 
do is used rather than the more general a-Chemoff divergences since it satisfies the properties 
(significant for practitioners). In addition, do can thus be efficiently implemented as a scalar 
product. di = — ps)/‘^ is a distance correlation measuring statistical dependence between two 

random variables, where ps is the Spearman’s correlation between X and Y. Notice that di can be 
expressed by using the copula C : [0,1]^ ^ [0,1] implicitly define d by the relation G{X,Y) = 


C{Gx{X), Gy{Y)) since ps{X, Y) = 12 C(u, v) du dv — 3 (Fredricks and Nelsen 


2007) 


Example 2 (Distance do between two Gaussians). Let (X, Y) be a bivariate Gaussian vector, with 
X Af{px^o-x), Y ^ o-y) and p(X, Y) = p. We obtain, 


dl{X,Y) = 0 


^ - Ps 


+ ( 1 - 0 ) 1 - 


2ax(7Y 

o ^ ^ ^ 


1 (mx 




Remember that for perfectly correlated Gaussians (p = ps = 1), we want to discriminate on their 
distributions. We can observe that 

• for (Jx^cfy +OC, then do(X, F) ^ 0 , it alleviates a main shortcoming of the basic L 2 
distance which is diverging to +00 in this case; 

• if px 7 ^ Py, for (Tx^cfy 0 , then do(X, F) ^ 1 , its maximum value, i.e. it means 
that two Gaussians cannot be more remote from each other than two different Dirac delta 
functions. 

We will refer to the use of this distance as the generic parametric representation (GPR) approach. 
GPR distance is a fast and good proxy for distance do when the first two moments p and a predom¬ 
inate. Nonetheless, for datasets which contain heavy-tailed distributions, GPR fails to capture this 
information. 

Property 3. Let 0 G [0,1]. The distance do verifies d < do <1. 

Proof. Let 0 G [0,1]. We have 

(i) 0 < do < 1, property of the Hellinger distance; 

(ii) 0 < di < 1 , since —l<ps^ 1 - 

Finally, by convex combination, 0 < d^ < 1. □ 

Property 4. For 0 < 0 < 1, do is a metric. 

Proof. Let (X, F) G V^. For 0 < 0 < 1 , do is a. metric, and the only non-trivial property to verify 
is the separation axiom 

(i) X = Ya.s. => do{X,Y) = 0 

X = Ya.s. ^di(G'x(X),Gy(F)) =do(Gx,Gy) =O,andthusd 0 (X,F) =0, 

(ii) d 0 (X, F) = 0 ^ X = F a.s. 

d 0 (X,F) = 0^di(Gx(X),Gy(F)) = 0anddo(Gx,Gy) = 0^Gx(X) = Gy(F) a.s. 
and Gx = Gy. Since G is absolutely continuous, it follows X = F a.s. 
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Notice that for 0 G {0,1}, this property does not hold. Let U e V, U U[0,1]. U 1 — U but 
do{U, 1-U) =0.LctV eV.V ^2V but di{V, 2V) =0. □ 


Property 5. Dijfeomorphism invariance. Let h : V V be a diffeomorphism. Let {X^Y) G V^. 
Distance dg is invariant under dijfeomorphism, i.e. 

dg{h{X),h{Y))=dg{X,Y). (5) 


Proof. From definition, we have 


and since 


we obtain 


dl(h{X),h{Y)) 


r dGh{x)dGh{Y),. 
-^ 


dG^x) 

dX 


(A) 


1 dGx 
h' {h~^{X)) dX 


(h-Hx)), 


4WX)MY)) = 1-1 

= dl{X,Y). 


( 6 ) 

(7) 


( 8 ) 


In addition, Vx G R, we have 

GhiX) {h{x)) =F[h{X) < h{x)] 

( F[X < x] = Gx{x), if h increasing 
\l—P[X<x] = l — Gx{x)^ otherwise 


(9) 


which implies that 


dl {h{X)MY)) = 3E [\Ghix){h{X)) - G^(y)(/i(F))|2] 

= 3E[|Gx(X)-Gy(F)|2] (10) 

= d\{XX). 

Finally, we obtain Property |^by definition of □ 

Thus, do is invariant under monotonic transformations, a desirable property as it ensures to be insen¬ 
sitive to scaling (e.g. choice of units) or measurement scheme (e.g. device, mathematical modelling) 
of the underlying phenomenon. 


2.3 A non-parametric statistical estimation of dg 


To apply the propounded distance dg on sampled data without parametric assumptions, we have to 
define its statistical estimate dg working on realizations of the i.i.d. random variables. Distance di 
working with continuous uniform distributions can be approximated by normalized rank statistics 
yielding to discret e uniform distributions, in fact coordinates of the multivariate empirical copula 
( [Deheuvels 1979) which is a non-parametric estimate converging uniformly toward the underlying 
copula ( [Detieuvels 19811. Distance do working with densities can be approximated by using its 
discrete form working on histogram density estimates. 


Definition 4 (The empirical copula transform). Let X^ = {Xf ..., Xj^), t = 1,..., T, be 
T observations from a random vector X = {Xi^... ^Xx) with continuous margins Gx = 
{Gxi{Xi)^... ,Gxn{^n))- Since one cannot directly obtain the corresponding copula ob¬ 
servations {Gxi{X\)^... ^Gxn{^n)) without knowing a priori Gx, one can instead estimate 
the N empirical margins G^fx ) = < x) to obtain T empirical observations 

(G3ci(^i)? • • • 5 which are thus related to normalized rank statistics as G'^.{X^) = 

X^^^ jT, where xf^ denotes the rank of observation Xj. 
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Definition 5 (Empirical distance). Let and be T realizations of real-valued ran¬ 

dom variables X, F G V respectively. An empirical distance between realizations of random vari¬ 
ables can be defined by 

dj “=^- ddl + (1 - 0 )dl (11) 

where 

= <‘ 2 > 

and 

= {\IXfL) - fgfifk)] > ( 13 ) 

k=—oo ^ ^ 

h being here a suitable bandwidth, and g^{x) = ^ l(Lf < ([^J + l)h) being a 

density histogram estimating pdf gx from T realizations of random variable X G V. 


We will refer henceforth to this distance and its use as the generic non-parametric representation 
(GNPR) approach. To use effectively do and its statistical estimate, it boils down to select a particular 
value for 0. We suggest here an exploratory approach where one can test (i) distribution information 
(0 = 0), (ii) dependence information (0 = 1), and (hi) a mix of both information (0 = 0.5). 
Ideally, 0 should reflect the balance of dependence and distribution information in the data. In 
a supervised setting, one could select an estimate 0 of the right balance optimizing some loss 
function by techniques such as cross-validation. Yet, the lack of a clear loss functi on makes the 
estim ati on of difficult in a n unsupervised settin g. For clustering, many authors ( Lange et H] 
|2004| ), ( [Shamir et al.[ [2007] ), ( [Shamir et al.j [200^, ([M e inshause n an d Buhlmann[ |2010| ) suggest 
stability as a tool for parameter selection. But, ( [Ben-David et al.(]2Q06[ ) warn against its irrelevant 
use for this purpose. Besides, we already use stability for clustering validation and we want to avoid 
overfitting. Finally, we think that finding an optimal trade-off is important for accelerating the 
rate of convergence toward the underlying ground truth when working with finite and possibly small 
samples, but ultimately lose its importance asymptotically as soon as 0 < < 1. 


3 Experiments and applications 

3.1 Synthetic datasets 

We propose the following model for testing the efficiency of the GNPR approach: N time series of 
length T which are subdivided into K correlation clusters themselves subdivided into D distribution 
clusters. 

Let (y/e)^^,beXi.i.d. random variables. Letp, D G N. LetX =pKD. Let(Z^)^^, 1 < i < X, 
be independent random variables. For 1 < i < X, we define 

K D 

Xi = y2Pk,iYk+Y^dXd, ( 14 ) 


a) <^d,i = 1, ifi = d—1 (mod D), 0 otherwise; 

b) [0,1], 

c) = fr if \iK/N^ = /c, 0 otherwise. 

(Xi)^i are partitioned into Q = KD clusters of p random variables each. Playing with the model 
parameters, we define in Tablesome interesting test case datasets to study distribution clustering, 
dependence clustering or a mix of both. We use the following notations as a shorthand: C := 
Laplace(0, and S := t-distribution(3)/\/3. Since C and S have both a mean of 0 and a 

variance of 1, GPR cannot find any difference between them, but GNPR can discriminate on higher 
moments as it can be seen in Fig.[^ 
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Figure 3: GPR and GNPR distance matrices. Both GPR and GNPR highlight the 5 correlation 
clusters (0 = 1), but only GNPR finds the 2 distributions (S and C) subdividing them (0 = 0). 
Finally, by combining both information GNPR (0 = 0.5) can highlight the 10 original clusters, 
while GPR (0 = 0.5) simply adds noise on the correlation distance matrix it recovers. 


Table 1: Model parameters for some interesting test case datasets 


Clustering 

Dataset 

N 

T 

Q 

K 


Yk 

z\ 




Distribution 

A 

200 

5000 

4 

1 

0 

Ar(0,1) 

M{o, 1) 

L 

5 

M{0, 2) 

Dependence 

B 

200 

5000 

10 

10 

0.1 

<S 

s 

S 

5 

<S 

Mix 

C 

200 

5000 

10 

5 

0.1 

Ar(o,1) 

M{0, 1) 

S 

Ar(0, 1) 

5 

G 

32, , 640 

10,...,2000 

32 

8 

0.1 

AA(0,1) 

A^(0, 1) 

N{0, 2) 

L 

S 


3.2 Performance of clustering using GNPR 


We empirically show that the GNPR approach achieves better results than others using common 
distances regardless of the algorithm used on the defined test cases A, B and C described in Table 
Test case A illustrates datasets containing only distribution information: there are 4 clusters of 
distributions. Test case B illustrates datasets containing only dependence information: there are 
10 clusters of correlation between random variables which are heavy-tailed. Test case C illustrates 
datasets containing both information: it consists in 10 clusters composed of 5 correlation clusters and 
each of them is divided into 2 distribution clusters. Using scikit-learn implementation ( [Pedregosa 
et al.[ |2011| ), we apply 3 clustering algorithms with different paradigms: a hierarchical clustering 


using average linkage (HC-AL), /c-means-F-F (KM-f-h), and affinity propagation (AP). Experiment 
results are reported in Table GNPR performance is due to its proper representation (cf. Fig.|^. 
Finally, we have noticed increasing precision of clustering using GNPR as time T grows to infinity, 
all other parameters being fixed. The number of time series N se ems rather uninformative as illus¬ 
trated in Fig. [^(left) which plots ARI ( [Hubert and Arable] 1985] ) between computed clustering and 
ground-truth of dataset G as an heatmap for varying N and T. Fig. (right) shows the convergence 
to the true clustering as a function of T. 


3.3 Application to financial time series clustering 
3.3.1 Clustering assets: a (too) strong focus on correlation 

It has been notic ed that straightfoward approaches automatically discover sector and industries 
( |Mantegn^|1999 ). Since detected patterns are blatantly correl ation-fiavoured, it prompt ed econo¬ 
physicists to focus on correlations, hierarchies and networks ( jTumminello et aLj [2010] ) from the 
Minim um Spanning Tree and its associated clustering algorithm the Single Linkage to the state of 
the art ( jMusmeci et al. 2014 ) exploiting the topological properties of the Planar Maximally Filtered 
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Figure 4: Distance matrices obtained on dataset C using distance correlation, L 2 distance, GPR and 
GNPR. None but GNPR highlights the 10 original clusters which appear on its diagonal. 


Table 2: Comparison of distance correlation, L 2 distance, GPR and GNPR: GNPR approach im¬ 
proves clustering performance 



Adjusted Rand Index 

Algo. 

Distance 

A 

B 

C 



0.00 ±0.01 

0.99 ± 0.01 

0.56 ± 0.01 


E[(X - Y)'^] 

0.00 ± 0.00 

0.09 ± 0.12 

0.55 ±0.05 


GPR 0 = 0 

0.34 ± 0.01 

O.OI ± 0.01 

0.06 ± 0.02 

HC-AL 

GPR 0 = 1 

0.00 ±0.01 

0.99 ± 0.01 

0.56 ± 0.01 

GPR6»= .5 

0.34 ± 0.01 

0.59 ± 0.12 

0.57 ± 0.01 


GNPR 6> = 0 

1 

0.00 ± 0.00 

0.17 ± 0.00 


GNPR 0 = 1 

0.00 ±0.00 

1 

0.57 ± 0.00 


GNPR6»= .5 

0.99 ± 0.01 

0.25 ± 0.20 

0.95 ±0.08 


(l-p)/2 

0.00 ± 0.01 

0.60 ± 0.20 

0.46 ±0.05 


E[(X - Y)'^] 

0.00 ±0.00 

0.34 ± 0.11 

0.48 ±0.09 


GPR 0 = 0 

0.41 ±0.03 

0.01 ±0.01 

0.06 ± 0.02 

KM-f-f 

GPR 0=1 

0.00 ±0.00 

0.45 ± 0.11 

0.43 ±0.09 

GPR6»= .5 

0.27 ±0.05 

0.51 ±0.14 

0.48 ±0.06 


GNPR 0 = 0 

0.96 ± 0.11 

0.00 ± 0.01 

0.14 ± 0.02 


GNPR 0=1 

0.00 ±0.01 

0.65 ±0.13 

0.53 ± 0.02 


GNPR6»= .5 

0.72 ±0.13 

0.21 ±0.07 

0.64 ± 0.10 



0.00 ± 0.00 

0.99 ±0.07 

0.48 ± 0.02 


E[(X - Y)'^] 

0.14 ±0.03 

0.94 ± 0.02 

0.59 ± 0.00 


GPR 0 = 0 

0.25 ±0.08 

0.01 ± 0.01 

0.05 ± 0.02 

AP 

GPR 0=1 

0.00 ±0.01 

0.99 ± 0.01 

0.48 ± 0.02 

GPR6»= .5 

0.06 ± 0.00 

0.80 ± 0.10 

0.52 ± 0.02 


GNPR 0 = 0 

1 

0.00 ± 0.00 

0.18 ± 0.01 


GNPR 0=1 

0.00 ±0.01 

1 

0.59 ± 0.00 


GNPR6»= .5 

0.39 ± 0.02 

0.39 ± 0.11 

1 


Graph ( [Tumminello et al.||2005| ) and it s associated algorithm the Directed Bubble Hierarchical Tree 
(DBHT) technique ( |Song et aLj 20121. In practice, econophysicists consider the assets log returns 
and compute their correlation matrix. The correlation matrix is then filtered thanks to a clustering 
of the correlation-network ( |Di Matteo et al.| 2010[ ) built using similarity and dissimilarity matrices 
which are derived from the correlation one by convenient ad hoc transformations. Clustering these 
correlation-based netw orks ([Qnnela et al.[|2004] ) aims at filtering the correlation matrix for standard 
portfolio optimization ( [Tola et ak 2008| ). Yet, adopting similar approaches only allow to retrieve 
information given by assets co-movements and nothing about the specificities of their returns be¬ 
haviour, whereas we may also want to distinguish assets by their returns distribution. For example, 
we are interested to know whether they undergo fat tails, and to which extent. 


3.3.2 Clustering credit default swaps 

We apply the GNPR approach on financial time series, namely daily credit default swap ( |Hul l[|2006|) 
(CDS) prices. We consider the N = 500 most actively traded CDS according to DTCC (http: 


9 





















































Clustering convergence to the ground-truth partition 




Figure 5: Empirical consistency of clustering using GNPR as T ^ oo 


/ /www. dt cc . com/ ). For each CDS, we have T = 2300 observations corresponding to historical 
daily prices over the last 9 years, amounting for more than one million data points. Since credit 
default swaps are traded over-the-counter, closing time for fixing prices can be arbitrarily chosen, 
here 5pm GMT, i.e. after the London Stock Exchange trading session. This synchronous fixing of 
CDS prices avoids spurious correlations arising from different closing times. For example, the use 
of close-to-close stock prices artificially overestimates intra-market correlation and underestimates 
inter-market dependence since they have different trading hours (|Martens and Poon| |2001|). These 
CDS time series can be consulted on the web portal http : / / www. datagrapple . com/ 


Assuming that CDS prices follow random walks, their increments AP^ = — pt-^ 

are i.i.d. random variables, and therefore the GNPR approach can be applied to the time series of 
prices variations, i.e. on data (APf,..., AP^), t = 1,..., T. Thus, for aggregating CDS prices 
time series, we use a clustering algorithm (for instance. Ward’s method ( |Ward| p^63 1) based on the 
GNPR distance matrices between their variations. 


Using GNPR = 0, we look for distribution information in our CDS dataset. We observe that 
clustering based on the GNPR = 0 distance matrix yields 4 clusters which fit precisely the multi¬ 
modal empirical distribution of standard deviations as can be seen in Fig.|^ For GNPR 6> = 1, we 
display in Fig. [ 7 ] the rank correlation distance matrix obtained. We can notice its hierarchical struc¬ 
ture already described in many papers, e.g. ( [Mantegn^ |1999| ), ( [Brida and Risso[ |201Q| ), focusing 
on stock markets. There is information in distribution and in correlation, thus taking into account 
both information, i.e. using GNPR 0 = 0.5, should lead to a meaningful clustering. We verify this 
claim by using stability as a criterion for validation. Practically, we consider even and odd trading 
days and perform two independent clusterings, one on even days and the other one on odd days. 
We should obtain the same partitions. In Fig. ^ we display the partitions obtained using the GNPR 
6> = 0.5 approach next to the ones obtained by applying a P 2 distance on prices returns. We find 
that GNPR clustering is more stable than L 2 on returns clustering. Moreover, clusters obtained from 
GNPR are more homogeneous in size. 


To conclude on the experiments, we have highlighted through clustering that the presented approach 
leveraging dependence and distribution information leads to better results: finer partitions on syn¬ 
thetic test cases and more stable partitions on financial time series. 


4 Discussion 

In this paper, we have exposed a novel representation of random variables which could lead to 
improvements in applying machine learning techniques on time series describing underlying i.i.d. 
stochastic processes. We have empirically shown its relevance to deal with random walks and 
financial time series. We have led a large scale experiment on the credit derivatives market no- 
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Figure 6: Standard Deviation Histogram. The 4 clusters found using GNPR 0 = 0 represented by 
the 4 colors fit precisely the multi-modal distribution of standard deviations. 



Figure 7: Centered Rank Correlation Distance Matrix. CNPR 0 = 1 exhibits a hierarchical structure 
of correlations: first level consists in Europe, Japan and US; second level corresponds to credit 
quality (investment grade or high yield); third level to industrial sectors. 
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Figure 8: Better clustering stability using the GNPR approach: GNPR = 0.5 achieves ARI = 0.85; 
1/2 on returns achieves ARI 0.64; The two leftmost partitions built from GNPR on the odd/even 
trading days sampling look similar: only a few CDS are switching from clusters; The two rightmost 
partitions built using a L 2 on returns display very inhomogeneous (odd-2,3,9 vs. odd-4,14,15) and 
unstable (even-1 splitting into odd-3 and odd-2) clusters. 


torious for not having Gaussian but heavy-tailed returns, first results are available on website 
WWW. datagrapple . com We also intend to lead such clustering experiments for testing ap¬ 
plicability of the method to areas outside finance. On the theoretical side, we plan to improve the 
aggregation of the correlation and distribution part by using elements of information geometry the¬ 
ory and to study the consistency property of our method. 
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